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Abstract 


Darwin claims in the Origin that similarity is evidence for common ances¬ 
try, but that adaptive similarities are “almost valueless” as evidence. This 
claim seems reasonable for some adaptive similarities but not for others. 
Here we clarify and evaluate these and related matters by using the law 
of likelihood as an analytic tool and by considering mathematical models 
of three evolutionary processes - directional selection, stabilizing selection, 
and drift. Our results apply both to Darwin’s theory of evolution and to 
modern evolutionary biology. 

Keywords: common ancestry, Darwin, drift, likelihood, natural selection. 


1 Introduction 

In the last paragraph of the Origin , Darwin (1859, p. 490) says that, in the begin¬ 
ning, life was breathed “into a few forms, or into one.” The caution embodied in 
“one or a few” is not to be found in present-day biology, which embraces the idea 
of universal common ancestry. Darwin tentatively reaches towards that stronger 
thesis a few pages earlier: 


... I believe that animals have descended from at most only four or 
five progenitors, and plants from an equal or lesser number. Analogy 
would lead me one step further, namely to the belief that all animals 
and plants have descended from some one prototype. But analogy 
may be a deceitful guide. Nevertheless all living things have much in 
common, in their chemical composition, their germinal vesicles, their 
cellular structure, and their laws of growth and reproduction. We see 
this even in so trifling a circumstance as that the same poison often 
similarly affects plants and animals; or that the poison secreted by 
the gall-fly produces monstrous growths on the wild rose or oak-tree. 
Therefore I should infer from analogy that probably all organic beings 
which have ever lived on this earth have descended from some one 
primordial form, into which life was first breathed. (Darwin 1859, p. 
484) 


Darwin’s idea that universal common ancestry is supported by the fact that “all 
living things have much in common” is an instance of a broader principle: when 
two or more taxa have trait A", this similarity favors the hypothesis of common 
ancestry over the hypothesis of separate ancestry. 
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Darwin advances a second epistemological thesis about common ancestry - 
that some similarities provide stronger evidence for common ancestry than others: 


... adaptive characters, although of the utmost importance to the wel¬ 
fare of the being, are almost valueless to the systematise For animals 
belonging to two most distinct lines of descent, may readily become 
adapted to similar conditions, and thus assume a close external re¬ 
semblance; but such resemblances will not reveal - will rather tend to 
conceal their blood-relationship to their proper lines of descent. (Dar¬ 
win 1859, p. 427) 


On the next page, he gives the example of the “shape of the body and the fin- 
like limbs” found in whales and fishes; these are “adaptations in both classes for 
swimming through water” and thus provide almost no evidence that the two groups 
have a common ancestor. 

Although Darwin’s principle - that adaptive similarities provide scant evidence 
for common ancestry - sounds right when it is applied to this example, there are 
other examples in which it sounds wrong. Darwin describes one of them: 


The framework of bones being the same in the hand of a man, wing 
of a bat, fin of the porpoise, and leg of the horse - the same number 
of vertebrae forming the neck of the giraffe and of the elephant, - and 
innumerable other such facts, at once explain themselves on the theory 
of descent with slow and slight successive modifications. The similarity 
of pattern in the wing and leg of a bat, though used for such different 
purposes, - in the jaws and legs of a crab, - in the petals, stamens, 
and pistils of a flower, is likewise intelligible on the view of the gradual 
modification of parts or organs, which were alike in the early progenitor 
of each class. (Darwin 1859, p. 479) 


The shared “framework of bones” seems to be evidence for common ancestry, and 
yet this morphology seems to be useful in the different groups (Lewens 2015). So 
which epistemological principle is right - that all adaptive similarities provide only 
meager evidence for common ancestry, or that some adaptive similarities provide 
weak evidence while others provide strong? If the latter, how can the one sort of 
adaptive similarity be separated from the other? 

Darwin’s prose suggests an answer to this last question: perhaps the shared 
framework of bones is strong evidence for common ancestry because it is used for 
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different purposes in these different groups. This suggestion seems to separate 
the torpedo-shape of whales and fish from the limb morphology of human beings, 
bats, porpoises, and horses. However, there is a more modern example that should 
give us pause about this proposal. Crick (1957) argued that the universality of 
the genetic code is strong evidence for common ancestry. Modern biology has 
retained his conclusion even though we now know that the genetic code isn’t 
universal; it is nearly universal, with almost all groups of organisms using one 
code and a few others using codes that are very similar, but not identical, to the 
one (Knight, Freeland, and Landweber 2001). The prevalent genetic code provides 
strong evidence for common ancestry even though it has the same purpose in all 
the living things that have it. 

2 The likelihood framework 

To sort out Darwin’s ideas concerning evidence for common ancestry, we need two 
concepts - one qualitative, the other quantitative. The former is provided by the 
law of likelihood (Hacking 1965): 

(Qual) Observation O favors hypothesis II\ over hypothesis H 2 if and only if 

Pr(0\Hi) > Pr(0\H 2 ). 

We will use this epistemological principle when we describe a fairly general set of 
assumptions in Section 3 that entails that 


Pr (taxa A and B have trait X\A and B have a common ancestor) > 

Pr (taxa A and B have trait X\A and B do not have a common ancestor). 


Qual takes this inequality to mean that the similarity connecting A and B favors 
the hypothesis of common ancestry (CA) over the hypothesis of separate ancestry 
(SA). However, Qual does not provide the resources for describing how the type 
of evolutionary process affects the degree to which the similarity favors CA over 
SA. For this purpose we will use 

(Quant) The degree to which O favors II] over H -2 is given by the likelihood ratio 

Pr(0\H ]) 

Pr(0\H 2 y 
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Given that taxa A and B share trait X , we will compare how strongly this sim¬ 
ilarity favors CA over SA when M is the evolutionary process that governed the 
evolution of X with how strongly the similarity favors CA over SA when N is the 
process at work. This will involve comparing two likelihood ratios: 


Ptm{A and B have trait X\CA) Pr^{A and B have trait X\CA) 
Ptm{A and B have trait X\SA) Pr^{A and B have trait X\SA) 


We consider which pairs of processes are related by this inequality in Sections 4-6. 

3 A sufficient condition for a similarity to favor 
common ancestry over separate ancestry 

Inspired by Reichenbach’s (1956) discussion of his principle of the common cause, 
we here describe a sufficient condition for a dichotomous trait shared by taxa A 
and B to favor the hypothesis that A and B have a common ancestor over the 
hypothesis that they do not. Figure [3] depicts the two hypotheses. 



Common Ancestry Separate Ancestry 

Figure 1: Taxa A and B are observed to have trait X. Does this observation favor 
Common Ancestry over Separate Ancestry? The probabilistic parameters a , d, b, 
and e are explained in the text. 


Notice that the states of A and B are described in the figure (A and B are 
both observed to have trait X), but the states of the ancestors postulated by the 
two hypotheses are not. The ancestors are represented by variables that take one 
of two values; the “+” value means that the postulated ancestor had trait X while 
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the ” value means that the ancestor lacked X. Next to the branches in Figure 
1 are lower-case letters that denote probabilities. These are 


a = Pr(A has X\C has A") 
d = Pr(A has X\C lacks X) 
b = Pr(B has X\C has X) 
e = Pr ( B has X\C lacks X) 


Pr(A has AJaSi has X) 
Pr(A has AT 15*] lacks A") 
Pr(B has X\S 2 has X) 
Pr(B has X\S 2 lacks A") 


There is one more probability that we need, but it isn’t in the figure: 


c = Pr(C has X) = Pr(S i has X) = Pr(S 2 has X) 


The fact that the parameters that attach to the Common Ancestry hypothesis also 
attach to the Separate Ancestry hypothesis represents an assumption: 


Assumption 1 (cross-model homogeneity): The probability that a taxon 
has a trait does not depend on whether the Common Ancestry or the 
Separate Ancestry hypothesis is true. 


We add four more assumptions: 

Assumption 2 (screening-off): 

(i) . Pr(A and B have X\±C) = Pr(A has X\±C)Pr(B has X\± C ). 

(ii) . Pr(A and B have A"|±S'i&±S' 2 ) = Pr(A has A|±S'i&± S 2 )Pr(B has A|± S'l&i S 2 ). 

(iii) . Pr(A has X\ ± S'i& ± S 2 ) = Pr(A has X\ ± Si) and 

Pr(B has X\ ± S& ± S 2 ) = Pr(B has X\ ± S 2 ) 

Assumption 3 (non-extreme probabilities): 0 < a, b, c, d, e < 1. 

Assumption 4 (ancestor independence): Pr(±S'i&±S' 2 ) = Pr(±Si) Pr(±S' 2 ). 

Assumption 5 (cross-branch homogeneity): ( a — d ) and (b—e) are either 
both positive or both negative. The common ancestor’s having trait X 
must make a difference (either positive or negative) in the probability 
that one of its descendants will have trait X, and it must make a dif¬ 
ference of the same sign in the probability that the other descendant 
will have trait X. 
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Notice that Assumption 5 is qualitative, not quantitative; it does not say that 
a = b and d — e. 

Simple algebra suffices to establish the following: 

If taxa A and B have trait X (where there are just two trait values) 
and assumptions 1-5 are true, then 

Pr (taxa A and B have trait X\CA) > Pr( taxa A and B have trait X\SA). 

It does not matter whether the evolutionary process at work in a branch is selec¬ 
tion or drift or whether the same process is at work in different branches. The five 
assumptions are not a priori true, but they are very general; all the mathematical 
models for the evolution of a dichotomous trait that biologists now use in phylo¬ 
genetic inference obey these assumptions (Lerney, Salcmi and Vandamme 2009). 

We note, in particular, that Assumption 5 holds for any Markov model of the 
evolution of a dichotomous trait since such models obey a “backwards inequality”: 

Pr t (descendant has X\ ancestor has A") > Pr t (descendant has X\ ancestor lacks A^), 
for any finite amount of time t between ancestor and descendant (Sober 2008). This 
inequality applies to each branch and so the cross-branch homogeneity assumption 
is satisfied. 

What happens if the evolving trait has n values Ad, X 2 , • • ■ , X n 7 Assumptions 
1-4 modify in a straightforward way, by simply replacing each of the two states ‘has 
X ’ (denoted by the +) and ‘lacks A"’ (denoted by —) by each of the n possible states 
(so, for example, the modification of Assumption 2(ii) will represent n 2 statements 
rather than just four when n > 2). However, the application of Assumption 5 
merits spelling out. What is needed is this: 

There exist states X t , Xj (distinct) and X k (possibly equal to X t or 
Xj) so that changing the state of the common ancestor from X, to Xj 
raises the probability that one descendant will be in state X k . And for 
any two distinct states Af and X m , if changing the state of the common 
ancestor from A) to X rn raises the probability that one descendant will 
be in state X kl the change also will raise the probability that the other 
descendant will be in state X k . 

In this case we have the following result, a brief proof of which is provided in the 
Appendix. 

Proposition 1 Under assumptions A1-A5, extended to allow n states, 

Pr(taxa A and B are in state X k \CA) > Pr(taxa A and B are in state X k \SA). 


7 



Although the argument for Proposition [T| goes through, the fact that the evolv¬ 
ing trait isn’t dichotomous opens the door to possible violations of Assumption 
5. An example is depicted in the accompanying table. In the branch leading to 
taxon A, there is strong selection for X 2 if the common ancestor C is in state X\, 
but strong stablizing selection prevents X 2 from evolving if C is in state X 3 and 
also prevents A 3 from evolving if C is in state X 2 . In the lineage leading to taxon 
B, the situation is just the reverse: there is strong selection for X 2 if the ancestor 
is in state X 3 , but stabilizing selection prevents X\ from evolving into X 2l and 
also prevents X 2 from evolving into X\. This difference in the processes governing 
trait evolution in the two branches gives rise to the two probabilistic inequalities 
described in the table; together, they violate Assumption 5. Suppose that ances¬ 
tors (in both the common ancestry and the separate ancestry models) have A 1; 
X 2l and A 3 with probabilities 0.49, 0.02, and 0.49, respectively. The result is that 
if taxa A and B are both in state X 2 , this similarity will favor separate ances¬ 
try over common ancestry. The likelihood of the common ancestry hypothesis is 
approximately 0.02, whereas the likelihood of the separate ancestry hypothesis is 
about (0.51) 2 . 



branch leading to taxon A 

branch leading to taxon B 

processes 

A —♦—►A 

A A«-r- A 

probabilistic inequalities 

Pr(A has A C has A) > 
Pr(A has A C has A) 

Pr(B has A | C has A) < 
Pr(B has A ChasA) 


Our analysis in this section concerns two taxa. If there are more than two, the 
taxa can differ in how closely related they are to each other under the common 
ancestry hypothesis. We address this complication in Section 7. 

Our results so far lend support to Darwin’s intuition that similarity is evidence 
for common ancestry. The assumptions needed to derive this result aren’t a priori 
but they are very widely satisfied]^] 

4 The 1 /p criterion and its limitations 

We now turn to the question of which processes strengthen the evidence that a 
similarity provides for common ancestry and which processes weaken that evidence. 
We begin with a simple argument presented by Sober and Steel (2014a). The 
likelihood ratio of CA to SA can be expanded as follows: 

3 For discussion of the relation of this argument to Reichenbach’s principle of the common 
cause, and for examples outside of evolutionary biology in which similarity can be evidence 
favoring a separate cause model over a common cause model, see Sober (2015). 












Pt'ca(A and B have trait A") Ptca{A has trait X\B has trait X)Ptca(B has trait X) 
Ptsa(A and B have trait X) Ptsa(A has trait X)Ppsa{B has trait A") 


If we use Assumption 1 above, that 

Ptca(B has trait A") = Ppsa{B has trait X) = p 

and assume further that the evolutionary process is uniform (meaning that simul¬ 
taneous branches have the same probabilities of changing state), so that 

Ptca(A has trait X) = Ppca{B has trait A), 
the likelihood ratio becomes: 

Ptca(A and B have trait A") Ptca(A has trait X\B has trait X) 
Ptsa{A and B have trait X) p 

Suppose, finally, that if A and B have a common ancestor, then the amount of 
time between A and B and their most recent common ancestor is very small. This 
entails that the likelihood ratio is approximately l /p. 

Given that this likelihood ratio gets bigger as p gets smaller, there is a simple 
argument for an implication of Darwin’s thesis that adaptive similarities provide 
little evidence for common ancestry. The argument does not describe the absolute 
amount of evidence that adaptive similarities provide, but it does say the following: 
if the value for p when X is adaptive is greater than the value for p when A" is 
neutral or deleterious, then neutral and deleterious similarities provide stronger 
evidence for common ancestry than adaptive similarities do. We will see in what 
follows that there are counterexamples to the consequent of the conditional just 
stated; there are adaptive similarities that provide stronger evidence for common 
ancestry than neutral similarities provide. Even so, the 1 /p argument is a good 
starting point j^j 

The argument has two limitations. The first, that the argument is formulated 
for just two taxa, will be removed in Section [7j The second limitation is the 
assumption that if A and B have a common ancestor, they have a very recent 
common ancestor. This limitation can be lifted by considering a Markovian pro¬ 
cess of character state evolution. For computational convenience, in this paper we 

4 Imperfect approximations of the l/p argument are presented in Sober (2008, pp. 297 -305; 
2011, p. 30). 
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consider the simplest model that allows different traits to have different probabil¬ 
ities, namely the equal input model. This model, for an n-state character A, says 
that for all states j, k different from i, Pr (descendant has Aj| ancestor has Xj ) = 
Pr (descendant has A 7 "*(ancestor has A*,) (Semple and Steel 2003)]^] Except where 
we consider directional selection, we will suppose the equal input model is station¬ 
ary. Any stationary equal input model involving any number of states entails the 
following succinct representation of the likelihood ratio (Sober and Steel, 2014a): 

LR P ca, SA = 1 + ( 2 ) 

Here the most recent common ancestor of A and B postulated by the CA model 
is t units of time in the past, p is the stationary probability of trait X, and r is a 
scaled rate of substitution between states. Notice that the likelihood ratio in ([2]) is 
greater than unity when t is any finite positive number (though it asymptotically 
approaches unity as t is made large) and that the ratio is made large by making p 
small. Later, we will see that Eqn. (|2]) also falls out as a corollary of Proposition |4j 


5 Directional selection versus drift 


Under CA, suppose that the root of the 2-taxon tree is in state X with probability 
q, and that s is the stationary probability for state X. Thus the probability p 
that a present day taxon is in state A" lies between q and s, so either p = q = s 
(neutrality) or q < p < s (selection for trait X) or s < p < q (selection against 
trait A") when t > 0. In the equal input model, the probability of being in state X 
if the process was in state Y at t units in the past is s + (1 — s)e~ rt when Y — X 
and s(l — e~ rt ) for Y ^ X. Therefore: 

p = q[s + (l- s)e~ rt ] + (1 - <?)[(! - e~ Tt \, 


which simplifies to the relationship: 

p = s( 1 - e~ rt ) + qe~ rt . (3) 

For t small, p is close to q (with equality at t — 0) and for t large, p is close to 
s (with equality in the limit as t —> oo). It is convenient to think of s and q as 
given, with p determined by these quantities (and r, t ) via Eqn. ([3]) . Consider now 
the likelihood ratio of CA to SA under directional selection, which we denote by 
LR^/sa- Applying Eqn. (3) in the denominator, we have: 


LR, 


DS 

CA/SA 


q[s + (1 — s)e rt ] 2 + (1 — g)[s(l — e rt )] 5 
[s(l — e ~ rt ) + qe~ rt ] 2 


(4) 


5 For 4-state characters this is sometimes called ‘Felsenstein’s 1981’ model or the ‘Tajima-Nei’ 
equal input model. 
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If any two of q, s,p are equal then all three are, in which case directional selection 
disappears, and from i§ we obtain the likelihood ratio of common ancestry to 
separate ancestry when there is drift (D), which we denote by LR^ A , SA . That is: 


LR 


D 


CA/SA 


= LRl 


CA/SA 


= 1 + 


(1 


—e~ 2rt . 


(5) 


We now consider the ratio pos/nit) of two likelihood ratios. One of them is 
the likelihood ratio for directional selection (LR((‘\/ SA ) when q ^ s; the other is 
the likelihood ratio for drift [LRca/sa)^ given by Eqn. (5). In both cases the two 
taxa are observed to have trait A". Thus, 


, N LR™ 

PDS/Dit ) - -• 

-^- rt CA/SA 


Notice that the numerator of pos/D^t) involves a non-stationary process while the 
denominator characterizes a stationary process (with the stationary probability 
of state X being equal to q). We assume q ^ 0 since otherwise the probability 
of observing trait A" at the present, according to the equilibrium (drift) model, 
is 0 (under either CA or SA). This framework allows us to derive the following 
proposition, which describes how the difference between directional selection and 
drift affects the degree to which a similarity favors common over separate ancestry: 


Proposition 2 

(i) For all t > 0, pos/nit) >lifl>q>s (selection against the trait) and 
PDS/Dit ) <1 if q < s (selection for the trait). 

(ii) If q = s or q = 1 then PDS/o(t ) = 1 for all t. Moreover, lim^oo p D s/D{t) 
and Pds/d( 0) both equal 1. 

Figure [2] shows three graphs of PDS/D{t ) - one compares selection with drift 
when there is selection for trait X while the other two make the comparison when 
there is selection against trait X (without loss of generality, we have taken r = 
1). Proposition [2] says that these behaviours are generic and provides a formal 
statement that accords with Darwin’s idea that adaptive characters provide less 
support for CA than non-adaptive characters do. In fact, our result replaces 
Darwin’s two types of similarity with three: In an equal input model of directional 
selection, deleterious similarities are better than neutral similarities, and neutral 
similarities are better than adaptive similarities in terms of how much they favor 
CA over SA. 
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Figure 2: Three graphs of pos/Dif), each of which compares directional selection 
and drift. In each, an ancestor (t units in the past) has probability q of being 
in state X. In two of the curves, selection is compared with drift when there is 
selection against trait A", which is observed in the two leaves; in the third curve, 
selection is compared with drift when there is selection for the trait found in the 
leaves. The time axis shows the expected number of state changes under the drift 
model. 


6 Stablizing selection versus drift 


Suppose two taxa at the present share state A". Consider the following ratio of 
likelihood-ratio values 


Pss/D(t ) = 


T P>SS 
CA/SA 

T f?D 

CA/SA 


under the equal input model on k states and two taxa. The numerator represents 
the likelihood ratio of common ancestry to separate ancestry when there is stabi¬ 
lizing selection (SS); the denominator represents that ratio when there is drift (D). 
Suppose p is the stationary probability of state X, and that r D and r S s denote the 
substitution rates under drift and stabilising selection, where ro > rssj^] From (J 2 J) 


6 Here we depart from the usual conceptualization of stablizing selection in which a population 
has a bell-shaped distribution of some quantitative phenotype and the fitness of a trait value 
montonically increases as it gets closer to the population mean. 
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it follows directly that: 


Pss/Dit ) 


1 + l 1 ?) e ~ 2rsst 

1 + (i 1 ?) 


In this setting stablizing selection inflates the likelihood ratio of CA over SA, and 
when the states have equal probability, the maximal inflation in the likelihood 
ratio grows according to a power law in the number of states: 

Proposition 3 

(a) For all t > 0, pss/oit) > 1- Moreover, Pss/d( 0) = 1 = lirn t^oo Pss/D(t), 
and pss/D(t) has a unique critical point at some value t* > 0 where pss/o(t ) 
takes its global maximum value M. 

(b) (i) If the substitution rate under drift is twice that for stabilising selection, 

then M can be stated as an explicit closed-form function of p. If, in 
addition, all k states are equally probable, then M = \{Vk + 1). 

(ii) More generally, if the substitution rate under drift is r > 1 times that for 
stabilising selection then we have the following asymptotic equivalence 
as k becomes large: M ~ C T ■ k 1 ~ 1 / T , where the term C T is independent 
of k and is given by: C T = (^A) • • 

Notice in part (b) (ii) that as r increases, the maximal value moves from being a 
small power of k (e.g. square root when r = 2) towards linear growth in k (as 
r —s- oo). Notice also that when r = 2 then C T ■ k 1 ~ 1 / T = \\fk in agreement with 
(b)(i). Figure [3] illustrates the behaviour of Pss/d{1 ) as a function of k and the 
ratio r = r D /r ss . 

It is useful to consider how this analysis of stabilizing selection versus drift 
is related to the 1/p argument described earlier. Consider equation (JTj) , which 
holds regardless of how much time the common ancestry hypothesis says there 
is from taxa A and B back to their most recent common ancestor. If drift and 
stabilizing selection assign the same value to p, how can the likelihood ratio of 
CA to SA be greater when there is stabilizing selection than when there is drift? 
The answer is that Pvca{A has X\B has A") has a higher value when there is 
stabilizing selection than when there is drift. This is easily seen, since the models 
used for both processes are stationary and time-reversible. 

Proposition [3] shows that Darwin overgeneralized when he said that adaptive 
similarities provide almost no evidence for common ancestry. An adaptive similar¬ 
ity provides stronger evidence than a neutral similarity when the adaptive char¬ 
acter evolves by stabilizing selection in an equal input model. It is arguable that 
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Figure 3: The impact of stablising selection versus drift on the likelihood ratio of 
common ancestry versus separate ancestry for an equal input model with 10 states 
(left) and 1000 states (right), where all states are equally probable. The rate of 
leaving a state is 10%, 50% and 100% higher for drift than for stabilising selec¬ 
tion. The time axis shows the expected number of state changes under stablising 
selection. 

the “framework of bones” that Darwin discussed and the near universality of the 
genetic code that we mentioned earlier each provide strong evidence for common 
ancestry because their evolution was governed by stabilizing selection. 

7 Going beyond two taxa 

In Proposition [lj we described a very general sufficient condition for a trait shared 
by two taxa to favor common ancestry over separate ancestry. Here we address a 
complication that arises when more than two taxa are considered. The complica¬ 
tion is that if more than two taxa have a common ancestor, there are different tree 
topologies that might connect those taxa to each other. Thus, the hypothesis of 
common ancestry is a disjunction in which each disjunct is a different tree topol¬ 
ogy. How is the likelihood of this disjunction to be compared with the likelihood 
of the separate ancestry hypothesis when n leaf taxa all have the same trait value? 
We address this question by identifying the tree topology that has the highest like¬ 
lihood and the one that has the lowest likelihood; this means that the likelihood 
of the common ancestry disjunction must fall somewhere in between. 

Suppose n taxa share state X, and under CA have a most recent common 
ancestor t time units in the past. Assume an equal input model of character 
state change in equilibrium (i.e. drift or stabilizing selection, but not directional 
selection) in which state A" has stationary probability p % 0. We consider two 
extreme scenarios for the tree linking these n taxa under CA: 
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Star tree: This tree has all n leaves adjacent to the root vertex, with edges 
of temporal length t. For this tree it is readily verified that: 


LRca/sa — P ( 1 + 


P 


+ (1 — p)(l — e~ rt ) n 


( 6 ) 


• Delayed tree: Consider a tree that has two edges both of length t, connect¬ 
ing the root to two leaves. If each of the remaining n — 2 leaves is attached 
to one or other of these leaves by edges of length zero, then we obtain a tree 
we call a ‘delayed tree’. For a delayed tree it is readily verified that: 


LRca/sa 


p + (1 — p)e 2r ' t 


P 


n— 1 


Notice that when ?z = 2, simple algebra shows that the two expressions on the right 
for LRqa/sa agree, and equal the expression in Eqn. (j2j) , which is to be expected 
since for two leaves the star and delayed tree are identical, and this is the only 
tree shape possible. Figure [4] illustrates the star and delayed trees, on either side 
of a ‘typical’ binary phylogenetic tree (note that (c) shows only an approximation 
to a delayed tree, since edges of length zero are difficult to see!). 





Figure 4: (a) a star tree (b) a binary tree (c) a delayed tree 


Proposition 4 For any fixed parameters (p,r,t), where p ^ 0,1, and r,t > 0, 
the likelihood ratio LR C a/sa is minimized when the underlying tree (under CA) is 
a star tree, and it is maximised by any delayed tree. 

This result has a number of immediate consequences: 

1 . LRqa/sa > 1 for all finite t > 0. That is, a shared trait always favors CA 
over SA under the equal input model. 
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2. When n > 2 this proposition (and the formula ([6]) for LR for the star tree) 
improves on the lower bound from Sober and Steel (2014a, Proposition 1) 
that stated the lower bound 


LRca/sa > 1 + 



e~ nr ' t , 


which grows exponentially with n when p < e rt . However, the star bound 
is better since it grows exponentially with n regardless of the size of p ^ 0,1. 

The result for the star tree in Proposition [4] might seem completely as expected. 
Some caution is in order, however, since a related question led to the surprising 
finding that the star tree is the ‘extreme case’ for certain types of equal input 
models, but not for others. More precisely, the star tree maximizes the mutual 
information between the states at the leaves and the root state for an equal input 
model on two equally probable states (Evans et al. 2000). However, the star tree 
can fail to maximize mutual information when the equal input model has five or 
more equally probable states, provided the number of leaves is sufficiently large, 
and the branch lengths lie in a certain range (Sly, 2011). 


8 Conclusions 

The idea that similarity is evidence for common ancestry has exceptions, but it 
holds in a very general circumstance, which we described by enumerating five as¬ 
sumptions. Three of these are familiar from the literature on causal modeling: 
intermediate probabilities, screening-off, and ancestor independence (Spirtes, Gly- 
mour, and Sheines 2000; Pearl 2009). Two further assumptions are more specific 
to the literature on phylogenetic inference: cross-model homogeneity and cross¬ 
branch homogeneity. We noted that the last of these is not an inevitable con¬ 
sequence of evolutionary theory. If it is violated, a similarity can favor separate 
ancestry over common ancestry. And even when the evolving trait obeys the five 
assumptions, tree topology complicates the likelihood comparison of common an¬ 
cestry and separate ancestry; we explained how an equal-input model permits the 
comparison to go forward when there are more than two leaf taxa. 

Turning to the question of which similarities provide stronger evidence for com¬ 
mon ancestry than which others, we began with a simple argument for the following 
thesis: the sharing of trait X among two leaf taxa provides stronger evidence for 
common ancestry the less probable it is that a taxon has trait X. This is the l /p 
argument of Section |4} The main limitation of this argument is that it assumes 
that if two taxa have a common ancestor, their most recent common ancestor was 
in the very recent past. Our analysis in Sections [5] and [6] dropped that assump¬ 
tion, but our results show that the 1 /p argument was on the right track, at least in 
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part. When the selection process is directional selection, deleterious similarities are 
better than neutral similarities, and neutral similarities are better than adaptive 
similarities. Darwin’s comment that an adaptive similarity is “almost valueless” 
is correct if the adaptive similarity is due to directional selection. However, when 
stabilizing selection is considered instead of directional selection, the situation is 
more subtle. 

Under an equal-input model, adaptive similarities that are the product of sta¬ 
bilizing selection are better than neutral similarities. This means that the l /p 
argument is mistaken in this instance, since stabilizing selection and neutral evo¬ 
lution can assign the same probability to a taxon’s having trait X. Our results 
concerning the impact of different evolutionary processes on the ratio of the like¬ 
lihoods of common ancestry and separate ancestry are summarized in Figure [sf 7 ] 

Although the 1 /p argument goes wrong in judging that stabilizing selection and 
drift are in the same boat when they assign the same probability to a taxon’s 
having trait X, reducing the value of p still plays a role in comparing these two 
processes. The maximal extent to which stabilizing selection can favor CA over 
SA, compared with drift, depends on p\ Figure [3] makes this plain for the special 
case of equally probable states. In that case, p — 1/k (where k is the number 
of states) and the maximum degree to which stablizing selection favors CA over 
SA, compared with drift, becomes large as p becomes small - this maximal ratio 
is described by a ^ relationship (when the drift substitution rate is twice the 
stabilizing selection substitution rate) but moves closer to a l /p relationship as the 
ratio of these two substitution rates grows. 

Notice also that in both Figure [2] and Figure |3j the likelihood advantage of one 
process over another sets in when t > 0 and disappears as t approaches infinity. 
This is a pattern that should be expected in Markov processes that allow transi¬ 
tions from any state to any other state by some sequence of steps (Sober and Steel 
2014b). 

Although we used an equal-input model to represent stabilizing selection, the 
fact remains that stabilizing selection does not require an equal-input model. This 
raises the question of how the difference between stabilizing selection and neutral 
evolution would affect the likelihood comparison of common and separate ancestry 
were stabilizing selection reconceptualized. For example, consider an ordered set 
of character states X 1; X 2 ,.. . ,X n , where the probability of evolving from X,- to 
Xj depends on the value of \i — j\ - the bigger the difference between i and 
j, the smaller the value of Pr (descendant has Xj|ancestor has Xj). This ordering 

' It is worth comparing this figure with Figure 3 in Sober and Steel (2014b), where the 
problem wasn’t evidence for common ancestry, but the question of how much information the 
present state of a lineage provides about its ancestral state. Sober and Steel use a Moran model 
framework to represent different evolutionary processes and take the present state of the lineage 
to be the frequency of an organismic trait. 
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larger 

A 


directional selection against trait X stabilizing selection 



drift 


V 

smaller directional selection for trait X 


Figure 5: A partial ordering of the likelihood ratios of Common Ancestry versus 
Separate Ancestry under four evolutionary processes. In each case, two taxa are 
observed to share trait A". 

constraint reflects a type of stabilizing selection. The equal-input model provided a 
tidy solution to the problem we posed, but other models, like the one just described, 
need to be explored as well. This caveat generalizes: it is worth considering how 
the epistemological significance of different types of similarity is affected by varying 
model assumptions. The present paper is not the end of the story. 

9 Appendix 

Proof of Proposition [7J- The proof hinges on the following result (a general version 
of Reichenbach’s theorem). 

Lemma 5 Suppose two events E \, E 2 and a third variable C take values in some 
discrete set S of states. For any x G S, let C x be the event that C = x. Suppose 
further that the following three conditions hold: 

(i) Ei and E 2 are conditionally independent given C x , for each x G S; 

(n) Pr(Ei\C x ) > Pr{E l \Cy) =► Pr(E 2 \C x ) > Pr{E 2 \C y ) for all x,y G S; 

(m) Pr(E 1 \C x ) 7^ Pr(E 1 \C y ) for some x,y G S with Pr(C x ) > 0 and Pr{C y ) > 0. 

Then Pr(Pi&P 2 ) > Pr(Ei)Pr(E 2 ). 

Proof: By a standard trick, a short proof is possible thanks to the following con¬ 
venient equation (which follows from assumption (i) and algebra): 

^ A 1 (x, y)A 2 (x, y)Pr{C x )Pr{Cy) = 2[Pr(E^E 2 ) - Pr(A 1 )Pr(A 2 )] (7) 

x£S,y&S 
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where Ai(x,y) = Pr(Ej\C x ) — Pr(E i \C y ), coupled with the observation that each 
summand in ([7]) is non-negative (by (ii)), and therefore the sum is strictly positive 
(by (iii)). □ 

With this result in hand, the proof of Proposition [l] now follows, by taking E\ 
to be the event that taxon A is in state X k , B to be the event that taxon B is in 
state X kl and C to be the state of the most recent common ancestor of A and B 
under CA. Lemma [5j together with Assumptions 2(i), 3, and 5, shows that: 

Pr(taxa A and B are in state X k \CA) > (8) 

Pr(taxon A is in state X k \CA)Pr(taxon B is in state X k \CA), (9) 

and Assumptions 2(ii), 2(iii), and 4 imply that Pr (taxa A and B are in state X k \SA) 
is equal to Pr (taxon A is in state X k \SA) Pr (taxon B is in state X k \SA), so that 
the latter term, by Assumption 1, is equal to (§• 

Proof of Proposition^ Notice that we can write: 

g ( g + (l- g )g)2 + (l- g ) 5 2(i-g)2 

pDS/D[) ( 5 +(g- S )0)2( 1 + (I- 1 )02) ’ 

where 9 = e~ rt . Let A denote the numerator of Pds/d( t) minus the denominator. 
Then tedious but straightforward algebra shows that: 

A=Q-l) 6(q - s)[s(l - 2 6 2 + 9 3 ) + q( 1 - d 3 )]. 

Now, since 9 e (0,1) for t > 0, we have that (1 — 2 9 2 + 9 3 ) > 0, and (1 — 9 3 ) > 0 
so the term in the square brackets in the last equation is strictly positive (since 
q > 0) and ^ — 1 > 0 (unless q — 1 in which case pDS/D(t ) = 1 f° r ah t). 

Consequently, the sign of A when q s and q ^ 1 is exactly the sign of 
(q — s), which gives the result claimed (since PDS/D(t ) is greater or smaller than 1 
precisely when A is positive or negative). The proof of part (ii) of Proposition [2] 
is straightforward. 

□ 

Proof of Proposition^ For part (a), consider the difference 8 between the 
numerator and denominator of pss/D(t)- Then, 

5 = (e~ 2rsst - e~ 2rDt ) > 0, 

for t > 0 since r D > rss■ Thus, pss/D(t ) > 1 f° r all t > 0. 
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Now, any solution to the equation j~ t pss/Dif ) = 0, satisfies: 


re st — se rt + (r — s) 


1 — p 

P 


= 0, 


( 10 ) 


where, for brevity we write r = ro and s = rss (so r > s) here and in what 
follows. To see that Eqn. (10) has a unique solution, notice that the left-hand-side 
is strictly positive when t = 0, and tends to —oo as t grows; and since the derivative 
of the left-hand-side with respect to t is rs(e st — e rt ) which is strictly negative for 
all t > 0, the left-hand-side cuts the t— axis exactly once, and so equals zero for a 
unique value t*, as claimed. 

When r = 2s, Eqn. (10) becomes (upon division through by s) the following 
quadratic equation for x = e st : 2x — x 2 = — j , which has a unique solution 
for x > 1, namely, 

= 1 + y/1 + (1 ~p)/p 


X 


and from this we obtain an explicit expression for M, namely: 

= 1 + (1 ~p)/px 
1 + (1 — p)/px 2) 

where x is given by Eqn. ( fTTj ) . In case p = 1/k, Eqn. ( |ITj ) gives: 

x = Vk + 1, 


( 11 ) 


( 12 ) 


from which Eqn. (12) becomes, upon simplification: 

M — hy/k + 1). 

For part (ii), let y = e~ st . The assumptions that r = r • s and p = 1/k imply 
that pss/oit) = i)yr ■ This expression is maximized at the t value for which y 
satisfies the equation: 


(k — 1)(t — 1 )y T + ry T 1 — 1 = 0. 


The solution to this last equation in the range (0,1) is (asymptotically as k grows) 

l/r 


given by y 




, from which part (ii) now follows. 


□ 


Proof of Proposition |^}' Suppose that T is a rooted tree on n> 2 leaves, where 
the root vertex p is the recent common ancestor of the leaves. Suppose that T is 
not a star tree. Then n > 3, and T has a vertex v that is adjacent to the root 
of T, and which has edges to at least two other pendant subtrees Tj,..., T^. Let 
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li and l 2 denote the lengths of the edges that connect the root to v and v to the 
root of Tj, respectively. Consider the tree T' obtained by reattaching T\ directly 
to the root of T, by an edge of length l\ + l 2 . We will show that T' has a lower 
probability that all its leaves are in state X than T does. It then follows that only 
the star tree minimizes this probability. 

Let Ei, E 2 and F denote, respectively, the events that all the leaves in T \, in 
T 2 -Tk, and in the remainder of T, are in state X , and let E denote the conjunction 
of these three events (i.e. the event that all the leaves of T are in state A"). Let 
E'i, E 2 and F' and E' denote, the corresponding events for tree T'. If Y p denotes 
the state at the root of each tree, then, by the law of total probability: 


Pr(E) = Y,Pr v (E l kE 2 kF)Pr(Y p = ;</), 


(13) 


and 


Pr'{E) = Y J Pr' y {E' l kE' 2 kF')Pr'{Y p = y), 


(14) 


where Pr and Pr' denote probabilities computed on T and T' respectively, and 
where Pr y and Pr' y denote (conditional) probabilities computed on T and T' re¬ 
spectively, conditional on the root-state event Y p = y. Notice that Pr(Y p — y) — 
Pr'(Y p = y ). Also, Pr y {E^E 2 kF) = Pr y {Ei&E 2 )Pr y {F) and Pr' y (E[&E' 2 &F) = 
Pr' y {E'ik,Ei^)Pr' y {F') since Y p screens-off Pi&P 2 from F in T, and also 
from F' in T' . Moreover, Pr y (F ) = Pr' y (F') for all y. Thus, to establish that 
Pr'(E') < Pr(E ), it suffices, by (JT3|) and (14), to show that, for every state y: 


Pr y (Ei&E 2 ) > Pr' (Pj&P'). 


(15) 


Notice that E\ and E 2 become conditionally independent once we specify the state 
at vertex v, which we denote as Y v (this variable also screens-off Y p from these Ei 
and E 2 ). Now, for i — 1 and i — 2, we have Pr y (Ej\Y v = A") > Pr y {E,\Y v = X') 
for any state X' ^ X\ moreover, the nature of the equal input model ensures that 
Pr y (Ei\Y v = X') = Pr y (Ei\Y v = X") for any two states X',X" that are different 
from X. In addition, Pr y (Y v = X') > 0 for X' = X and for at least one other 
state X' ^ X (since p ^ 0,1). Thus, we have satisfied conditions (i)-(iii) in the 
general version of Reichenbach’s theorem (Lemma [5j taking C = Y v ) to deduce 
that: 

Pr,(Pi&P 2 ) > Pr y {Ei)Pr y {E 2 ). (16) 

Turning to Pr y (E[SzE' 2 ) we have: 

Pr;(P(&P') = Pr y (E[) Pr' y (E 2 ). 
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Now, considering the right-hand-side of this last equation, notice that: 


Pr'(E[) = Pr y (E l ) and iV(S') = Pr y (E 2 ). 


Tims, Pr'y^E'^E^) = Pr y (E 1 )Pry(E 2 ) which, by (16), establishes the required 
Inequality (15). 


For the result concerning the delayed tree we use an equivalent description 
of the equal input model sometimes referred to as the ‘Fortuin-Kasteleyn’ ran¬ 
dom cluster model (see Section 2.1 of Matsen, Mossel and Steel 2008). Let 
C = 1,2,... ,n be the number of clusters (blocks of the partition of the set of 
leaves of T induced by an independent Poisson process that acts with intensity r 
along the edges of the tree; the partition regards two leaves as being in the same 
block if the path between them does not cross an edge on which the Poisson event 
has occurred). Here r is the substitution rate, divided by 1 minus the sum of the 
squares of the stationary probabilities of the states. Then if *f x denote the prob¬ 
ability that, under the equal input model, all n leaves of T are all in state A", and 
if p denotes the stationary probability of state X, the random cluster description 
allows us to write if x as follows: 


ipx = E[p c ] = ^2 Pr(C = i)p\ (17) 

i —1 


Thus, 


tfx < P ■ P(C = 1) + p 2 ■ P(C > 1). 
Noting again that P(C = 1) = e~ rL , we get: 


Va < pe rL +p 2 ( 1 — e rL ) = p 2 + p(l — p)e rL . 


(18) 


Now, L > 2t with equality if and only if the tree is a delayed tree. It follows that 
< p 2 + p( 1 — p)e~ 2r ' t . Dividing this by again by p n we arrive at the upper 
bound on LR given by the expression for the delayed tree. □ 
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